decode stage
Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System
Fang, Yunhua, Xie, Rui, Haq, Asad Ul, Ma, Linsen, Maghraoui, Kaoutar El, Wang, Naigang, Wang, Meng, Liu, Liu, Zhang, Tong
Abstract--Large Language Model (LLM) inference is increasingly constrained by memory bandwidth, with frequent access to the key-value (KV) cache dominating data movement. While attention sparsity reduces some memory traffic, the relevance of past tokens varies over time, requiring the full KV cache to remain accessible and sustaining pressure on both bandwidth and capacity. This work investigates dynamic KV cache placement across such systems to maximize aggregated bandwidth utilization under capacity constraints. Rather than proposing a specific scheduling policy, we formulate the placement problem mathematically and derive a theoretical upper bound, revealing substantial headroom for runtime optimization. T o our knowledge, this is the first formal treatment of dynamic KV cache scheduling in heterogeneous memory systems for LLM inference.
Exploring the Dynamic Scheduling Space of Real-Time Generative AI Applications on Emerging Heterogeneous Systems
Karami, Rachid, Patwari, Rajeev, Kwon, Hyoukjun, Sirasao, Ashish
The integration of generative AI models, particularly large language models (LLMs), into real-time multi-model AI applications such as video conferencing and gaming is giving rise to a new class of workloads: real-time generative AI (RTGen). These workloads combine the compute intensity and dynamic execution patterns of generative models with the stringent latency and concurrency constraints of real-time inference. To meet the diverse demands of RTGen workloads, modern edge platforms increasingly adopt heterogeneous system-on-chip (SoC) architectures that integrate CPUs, GPUs, and NPUs. Despite the potential of heterogeneous SoC, the scheduling space complexity and performance implications of RTGen workloads on such platforms remain underexplored. In this work, we perform a comprehensive characterization of RTGen workloads on AMD's latest heterogeneous SoC, Ryzen AI. We construct realistic multi-model scenarios inspired by industry use cases and profile model performance across all available backends. Using this data, we evaluate five scheduling policies and their impact on both real-time metrics (e.g., deadline violation rate) and LLM performance (e.g., time-to-first-token and tokens-per-second). Our results show that scheduling decisions significantly affect workload performance (e.g., leading to a 41.7% difference in deadline violation rates on average), and highlight the need for scheduling strategies that are aware of workload dynamics and hardware heterogeneity. Our findings underscore the importance of workload-aware, dynamic heterogeneous scheduling in enabling high-performance, on-device RTGen applications.
- North America > United States > California > Orange County > Irvine (0.04)
- Europe (0.04)
- Asia > Middle East > Saudi Arabia > Northern Borders Province > Arar (0.04)
- Semiconductors & Electronics (0.49)
- Information Technology (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Generation (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.82)
AccLLM: Accelerating Long-Context LLM Inference Via Algorithm-Hardware Co-Design
Liang, Yanbiao, Shi, Huihong, Shao, Haikuo, Wang, Zhongfeng
--Recently, large language models (LLMs) have achieved huge success in the natural language processing (NLP) field, driving a growing demand to extend their deployment from the cloud to edge devices. However, deploying LLMs on resource-constrained edge devices poses significant challenges, including (1) intensive computations and huge model sizes, (2) great memory and bandwidth demands introduced by the autoregressive generation process, and (3) limited scalability for handling long sequences. T o address these challenges, we propose AccLLM, a comprehensive acceleration framework that enables efficient and fast long-context LLM inference through algorithm and hardware co-design. At the algorithmic level, we integrate (1) pruning, (2) Λ-shaped attention, and (3) an innovative W 2A 8 KV4 ( 2-bit weights, 8-bit activations, and 4 - bit KV cache) quantization scheme, thus effectively reducing memory and bandwidth requirements while facilitating LLMs' long-sequence generation. At the hardware level, we design a dedicated FPGA-based accelerator with a reconfigurable computing engine to effectively and flexibly accommodate diverse operations arising from our compression algorithm, thereby fully translating the algorithmic innovations into tangible hardware efficiency. Large language models [1]-[4] (LLMs) have revolutionized natural language processing (NLP) with their outstanding capabilities, enabling a wide range of applications [5], including code generation [6], document summarization [7], chatbots [2], and question answering [8]. This impressive potential has driven growing interest in extending LLMs' deploying beyond traditional cloud-based platforms to edge devices, such as smart vehicles, robots, and embedded systems [9]-[11]. However, mainstream works have merely focused on optimizing and accelerating LLMs on GPUs [12], [13] with powerful resource capacity, making them unsuitable for resource-constrained edge scenarios [14], [15]. This work was supported by the National Key R&D Program of China under Grant 2022YFB4400600. Zhongfeng Wang is with the School of Electronic Science and Engineering, Nanjing University, and the School of Integrated Circuits, Sun Y at-sen University (email: zfwang@nju.edu.cn). Correspondence should be addressed to Zhongfeng Wang. Figure 1.
- Asia > China > Jiangsu Province > Nanjing (0.24)
- Asia > Middle East > Israel (0.04)
- Education (0.54)
- Semiconductors & Electronics (0.35)
- Information Technology (0.34)
Hybrid Offline-online Scheduling Method for Large Language Model Inference Optimization
Pang, Bowen, Li, Kai, She, Ruifeng, Wang, Feifan
--With the development of large language models (LLMs), it has become increasingly important to optimize hardware usage and improve throughput. In this paper, we study the inference optimization of the serving system that deploys LLMs. T o optimize system throughput and maximize hardware utilization, we formulate the inference optimization problem as a mixed-integer programming (MIP) model and propose a hybrid offline-online method as solution. The offline method improves large-scale inference systems by introducing a Minimizing Makespan Bin Packing Problem. We further provide a theoretical lower bound computation method. Then, we propose an online sorting and preemptive scheduling method to better utilize hardware. In the online iteration scheduling process, a Lagrangian method is applied to evaluate the cost efficiency of inserting prefill stages versus decode stages at each iteration and dynamically determine when to preempt decoding tasks and insert prefill tasks. Experiments using real-world data from the LLaMA-65B model and the GSM8K dataset demonstrate that system utilization improves from 80.2% to 89.1%, and the total inference time decreases from 201.00 to 190.58 seconds. A 100-cases study shows that our method consistently outperforms the baseline method and improves the utilization rate by 8.0% on average. Finally, we discuss potential future extensions, including stochastic modeling, reinforcement learning-based schedulers, and dynamic decision-making strategies for system throughput and hardware utilization. Note to Practitioners --This work provides optimization tools for enhancing the efficiency of LLM inference systems through advanced scheduling techniques. From the perspective of LLM inference service providers, improved hardware utilization can reduce operational costs by requiring less hardware to maintain the same level of service. From the user's perspective, reduced inference time translates to faster response times and improved service quality. Furthermore, the proposed scheduling techniques are adaptable to various LLM models, hardware platforms, and datasets, making them highly scalable and broadly applicable to real-world LLM inference scenarios. Recent advancements in large language models (LLMs), including GPT -4, LLaMA, and Qwen, have significantly transformed the landscape of natural language processing by enabling more sophisticated text generation, comprehension, and interaction capabilities. These models serve as founda-tional technologies in a wide range of applications, such as chatbots, machine translation, and content creation. She are with Noah's Ark Lab, Huawei.
- Asia > China > Beijing > Beijing (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > China > Zhejiang Province > Hangzhou (0.04)
- (4 more...)
ProMoE: Fast MoE-based LLM Serving using Proactive Caching
Song, Xiaoniu, Zhong, Zihang, Chen, Rong
The promising applications of large language models are often constrained by the limited GPU memory capacity available on edge devices. Mixture-of-Experts (MoE) models help mitigate this issue by activating only a subset of the model's parameters during computation, allowing the unused parameters to be offloaded to host memory and reducing overall GPU memory demand. However, existing cache-based offloading solutions handle cache misses reactively and significantly impact system performance. In this paper, we propose ProMoE, a novel proactive caching system that leverages intermediate model results to predict subsequent parameter usage. By proactively fetching experts in advance, ProMoE removes the loading time from the critical path and diminishes the performance overhead of offloading. Our evaluations demonstrate that ProMoE achieves an average speedup of 2.13x and 2.84x in the prefill and decode stages respectively, compared to existing offloading solutions.
- North America > United States > New York > New York County > New York City (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
- (4 more...)
Demystifying Platform Requirements for Diverse LLM Inference Use Cases
Bambhaniya, Abhimanyu, Raj, Ritik, Jeong, Geonhwa, Kundu, Souvik, Srinivasan, Sudarshan, Elavazhagan, Midhilesh, Kumar, Madhu, Krishna, Tushar
Large language models (LLMs) have shown remarkable performance across a wide range of applications, often outperforming human experts. However, deploying these parameter-heavy models efficiently for diverse inference use cases requires carefully designed hardware platforms with ample computing, memory, and network resources. With LLM deployment scenarios and models evolving at breakneck speed, the hardware requirements to meet SLOs remains an open research question. In this work, we present an analytical tool, GenZ, to study the relationship between LLM inference performance and various platform design parameters. Our analysis provides insights into configuring platforms for different LLM workloads and use cases. We quantify the platform requirements to support SOTA LLMs models like LLaMA and GPT-4 under diverse serving settings. Furthermore, we project the hardware capabilities needed to enable future LLMs potentially exceeding hundreds of trillions of parameters. The trends and insights derived from GenZ can guide AI engineers deploying LLMs as well as computer architects designing next-generation hardware accelerators and platforms. Ultimately, this work sheds light on the platform design considerations for unlocking the full potential of large language models across a spectrum of applications. The source code is available at https://github.com/abhibambhaniya/GenZ-LLM-Analyzer .
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs
Zeng, Shulin, Liu, Jun, Dai, Guohao, Yang, Xinhao, Fu, Tianyu, Wang, Hongyi, Ma, Wenheng, Sun, Hanbo, Li, Shiyao, Huang, Zixiao, Dai, Yadong, Li, Jintao, Wang, Zehao, Zhang, Ruoyu, Wen, Kairui, Ning, Xuefei, Wang, Yu
Transformer-based Large Language Models (LLMs) have made a significant impact on various domains. However, LLMs' efficiency suffers from both heavy computation and memory overheads. Compression techniques like sparsification and quantization are commonly used to mitigate the gap between LLM's computation/memory overheads and hardware capacity. However, existing GPU and transformer-based accelerators cannot efficiently process compressed LLMs, due to the following unresolved challenges: low computational efficiency, underutilized memory bandwidth, and large compilation overheads. This paper proposes FlightLLM, enabling efficient LLMs inference with a complete mapping flow on FPGAs. In FlightLLM, we highlight an innovative solution that the computation and memory overhead of LLMs can be solved by utilizing FPGA-specific resources (e.g., DSP48 and heterogeneous memory hierarchy). We propose a configurable sparse DSP chain to support different sparsity patterns with high computation efficiency. Second, we propose an always-on-chip decode scheme to boost memory bandwidth with mixed-precision support. Finally, to make FlightLLM available for real-world LLMs, we propose a length adaptive compilation method to reduce the compilation overhead. Implemented on the Xilinx Alveo U280 FPGA, FlightLLM achieves 6.0$\times$ higher energy efficiency and 1.8$\times$ better cost efficiency against commercial GPUs (e.g., NVIDIA V100S) on modern LLMs (e.g., LLaMA2-7B) using vLLM and SmoothQuant under the batch size of one. FlightLLM beats NVIDIA A100 GPU with 1.2$\times$ higher throughput using the latest Versal VHK158 FPGA.
- North America > United States > California > Monterey County > Monterey (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- (5 more...)
- Information Technology (0.55)
- Semiconductors & Electronics (0.35)
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference
Chen, Hongzheng, Zhang, Jiahao, Du, Yixiao, Xiang, Shaojie, Yue, Zichao, Zhang, Niansong, Cai, Yaohui, Zhang, Zhiru
Recent advancements in large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads. The majority of existing approaches rely on temporal architectures that reuse hardware units for different network layers and operators. However, these methods often encounter challenges in achieving low latency due to considerable memory access overhead. This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on FPGAs. Our approach involves the specialization of distinct hardware units for specific operators or layers, facilitating direct communication between them through a dataflow architecture while minimizing off-chip memory accesses. We introduce a comprehensive analytical model for estimating the performance of a spatial LLM accelerator, taking into account the on-chip compute and memory resources available on an FPGA. Through our analysis, we can determine the scenarios in which FPGA-based spatial acceleration can outperform its GPU-based counterpart. To enable more productive implementations of an LLM model on FPGAs, we further provide a library of high-level synthesis (HLS) kernels that are composable and reusable. This library will be made available as open-source. To validate the effectiveness of both our analytical model and HLS library, we have implemented BERT and GPT2 on an AMD Alveo U280 FPGA device. Experimental results demonstrate our approach can achieve up to 16.1x speedup when compared to previous FPGA-based accelerators for the BERT model. For GPT generative inference, we attain a 2.2x speedup compared to DFX, an FPGA overlay, in the prefill stage, while achieving a 1.9x speedup and a 5.7x improvement in energy efficiency compared to the NVIDIA A100 GPU in the decode stage.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Massachusetts > Suffolk County > Boston (0.14)
- North America > United States > New York > New York County > New York City (0.05)
- (6 more...)